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Abstract — We investigate approximating joint distributions of 
random processes with causal dependence tree distributions. Such 
distributions are particularly useful in providing parsimonious 
representation when there exists causal dynamics among pro- 
cesses. By extending the results by Chow and Liu on dependence 
tree approximations, we show that the best causal dependence 
tree approximation is the one which maximizes the sum of 
directed informations on its edges, where best is defined in 
terms of minimizing the KL-divergence between the original 
and the approximate distribution. Moreover, we describe a 
low-complexity algorithm to efficiently pick this approximate 
distribution. 



I. Introduction 

For many problems in statistical learning, inference, and 
prediction, it is desirable to find a parsimonious representation 
of the full joint distribution of multiple random processes with 
various interdependencies. Such an approximation of the joint 
distribution can lend itself both to easier analysis and infer- 
ence, as well as reduced storage requirements. More impor- 
tantly, parsimonious representations facilitate visualization and 
human comprehension of data. Specifically, in situations such 
as network intrusion detection, decision making in adversarial 
environments, and first response tasks where a rapid decision 
is required, such representations can greatly aid the situation 
awareness and decision making process. 

To facilitate analysis and visualization, graphical representa- 
tions are used to describe both the full and the approximating 
distributions p|-pO|. In such representations, variables are 
represented as nodes and undirected edges between each pairs 
of variables depict statistical dependence. Therefore, a variable 
is statistically independent of all of the variables it does not 
share an edge with [10|. 

One of the simplest graph structures is a tree. A tree is a 
connected graph on n nodes which has n — 1 edges, and con- 
sequently has no loops. Dependence tree approximations are 
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comparatively simple to analyze (few dependencies retained) 
and require significantly less storage requirements (storing the 
full joint requires exponential space in the number of variables; 
dependence trees require linear space). 

There are many choices for tree approximations, and often 
a criterion, such as Kullback-Leibler (KL) divergence, is used 
to define "goodness." Chow and Liu showed that dependence 
tree approximation with the minimum KL divergence was 
the one that maximized the sum of the mutual informations 
between variables sharing an edge (TTJ. They also identified 
a low complexity algorithm, based on minimum spanning 
tree algorithm, to identify this best tree fTT) . Their proposed 
algorithm only requires the computation of second order dis- 
tributions (pairwise interactions) find the best approximation 
of the whole joint density. 

For some learning and inference problems, it might be 
desirable to have models which keep the temporal structure. 
Directly applying Chow and Liu's procedure to multiple 
random processes can yield approximations which do not 
preserve temporal structure and which become increasingly 
complex with time. This can be demonstrated with an ex- 
ample. Consider the problem of identifying a simple but 
meaningful summary of how car prices {Ci, C*2, . . . , C365}, 
the number of cars sold {Si, S2, ■ ■ ■ , S^}, and gas sales 
{Gi, G2, ■ ■ ■ , G365} in a town change over the course of a 
year. Suppose we have access to the full joint distribution 
Pc ,365 ,s 36B ,G 365 ( c365 ! s365 iff 365 )- One possible result is shown 
in Figure [T] This figure only shows the beginning of the 
processes; there are over one thousand nodes in this tree. 
Even though this graph does not have many edges for the 
number of nodes present (much simpler than the full joint), 
it has a complicated structure, making analysis difficult. With 
increasing time, it would become more complicated. 

Also, this approximation, like almost all other possible 
Chow and Liu approximations for this problem, does not 
preserve temporal ordering. If we tried to interpret causal 
dependencies between the variables shown in Figure [T] as is 
done in (T]-|[T0), we would conclude that either a) the price 
of cars on day one (Ci) depends on car sales on day two 
(£2) an d gas on day one (Gi) depends on the price of cars 
on day two (G2) or b) car sales on day one (Si) depends 
on the number of cars sold on day two (C2). In either case, 
a process on day one is seen to depend on another process 
on day two. For real world examples with causal dynamics, 
the present might depend on the past, but not the future. 
While this approximation can be easily used to infer correlative 
influences, it might be difficult to infer causal influences from 
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Gas sales 



Fig. 1. A possible result in applying Chow and Liu's work to the example 
of car prices, car sales, and gas sales (C 365 , S 365 , and G 365 respectively) 
over a year. With over one thousand variables, the structure gets increasingly 
complex with time. Most importantly, even though the system dynamics are 
causal, the tree approximation is not. 




Gas sales 



Fig. 3. A possible causal dependence tree approximation for the example of 
car prices, car sales, and gas sales (C 365 , S 365 , and G 365 respectively) over 
a year. The graphical complexity is low and does not grow with time. The 
dependence tree is causal, which is important since the underlying system 
being approximated is also causal. 



Car prices 




Car Sales 



Gas sales 

Fig. 2. A possible result of applying Chow and Liu's work to the 
example of car prices, car sales, and gas sales (C 365 , S 365 , and G 365 
respectively) over a year, where each process is treated as a random object. 
The graphical complexity is low and does not grow with time. However, no 
causal relationships can be inferred, only correlative ones. 



it. 

Although directly applying the Chow and Liu procedure to 
multiple random processes might result in an approximation 
with undesirable properties, there is an alternative way to apply 
the procedure. Consider treating each process as a random 
object. A possible Chow and Liu approximation of this for the 
example above is shown in Figure [2] With this technique, the 
complexity is low for all time and the processes are kept intact. 
Consequently, inferring relationships between the processes is 
much simpler. However, since all of the time steps are kept 
together, still no causal influences can be inferred and only 
correlative relationships can be recovered. 

II. Our contribution and related work 

A. Our Contribution 

In this paper, we develop a procedure similar to Chow and 
Liu's, but in the context of random processes. Our approach 
is motivated by approximating real world dynamical systems, 
where there are physical, causal relationships. Our approach 
recovers a parsimonious causal tree representation that ap- 
proximates the original system dynamics. The goodness of the 
approximation is measured by KL divergence. We show that 
the causal dependence tree approximation with the minimum 
KL divergence is the one that maximizes the sum of the 
pairwise directed informations between processes sharing an 
edge. This allows us to present a low complexity maximum 
weight directed spanning tree algorithm for calculating the best 
approximate causal tree. 

Such a tree, as demonstrated in Figure [3] for the example re- 
garding car prices, car sales, and gas sales, can be represented 
graphically with directed edges corresponding to the direction 



of influence. Besides maintaining the causal dynamics, which 
is a property of most real systems, our proposed approach 
does not suffer from quick growth of complexity with time, as 
do [ 1 1 — ( 1 1 (Figure [TJ, since it works with random processes 
which are not intermixed, like in Figure [2] 

B. Related work 

There is a large body of work on approximating joint 
distributions with probabilistic graphical models, which are 
often called Bayesian networks p|-pO|. Chow and Liu were 
the first researchers in this field to investigate tree approxima- 
tions fTT| for discrete random variables. Suzuki extended the 
result to general random variables [12|. Carvalho and Oliveira 
considered Chow and Liu's problem for metrics other than KL 
divergence [13]. Meila and Jordan generalized the Chow and 
Liu procedure to find the best mixture-of-trees approximation 
JT4"] . Choi et al. developed methods based on Chow and Liu's 
to learn dependence tree approximations of distributions with 
hidden variables p5) . 

The work in Bayesian networks largely addresses correlative 
relationships, not causal ones. There has been work in de- 
veloping methods to identify statistically causal relationships 
between processes. When the processes can be modeled by 
multivariate auto-regressive models, Yuan and Lin developed a 
method, "Group Lasso," which can be used to infer the causal 
relationships [16]. Bolstad et al. recently showed conditions 
under which the estimates of Group Lasso are consistent 
and propose modifications to improve the reliability [17]. 
Materassi has developed methods based on Wiener filtering to 
infer statistically causal influences in linear dynamic systems. 
Consistency results have been derived for the case when the 
underlying dynamics have a tree structure JT8), |19|. 



Granger proposed a widely adopted framework for identify- 
ing causal influences based on statistical prediction ]20) . There 
have been a number of proposed quantitative measures based 
on this. There are many based on Granger's original measure 
based on linear models |20|, but will not be referenced here. In 
the context of dynamical systems, Marinazzo et al. developed 
a measure of Granger causality based on kernel methods for 
multiple processes [21 1. Massey and Rissanen independently 
proposed a measure, directed information p2| , (23), which 
is based on earlier work by Marko [24]. Solo presented an 
alternative measure of statistical causality similar to directed 



information which uses analysis of deviance |25|. 
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There have been some applications of directed information. 
Quinn et al. used directed information estimates to infer 
causal relationships between between simultaneously recorded 
neurons (26). Rao et al. used directed information estimates 
to infer causal relationships in gene regulatory networks (27] . 
In addition to its use in identifying statistically causal influ- 
ences, directed information also plays a fundamental role in 
communication with feedback [23], |24|, 1 28 1 — [ 3 1 1, prediction 
with causal side information [22|, |26|, gambling with causal 
side information |32|, (33], control over noisy channels |29|, 
||34|— [37 1, and source coding with feed forward (33], ]38[. 



C. Paper organization 

The paper organization is as follows. In Section III, we 
establish definitions and notations. In Section IV, we discuss 
the problem setup of developing meaningful approximations 
for a joint distribution of random variables and review the 
result of Chow and Liu (TT) . In Section V, we discuss 
approximating dynamical systems to motivate our approach to 
solving the problem. In Section VI, we present our main result 
of finding the causal dependence tree approximation which 
best approximates the full joint with respect to KL divergence. 
In Section VII, we discuss a low complexity algorithm to 
identify this best causal dependence tree approximation. In 
Section VIII, we analyze properties of causal dependence 
trees, such as the number of variable dependencies kept and 
storage requirements, as compared to the full joint distribution 
and Chow and Liu dependence tree approximations. In Sec- 
tion IX, we evaluate the performance of causal dependence 
tree approximations in a binary hypothesis test example, in 
comparison with the full distributions and Chow and Liu 
dependence tree approximations. 

III. Definitions and Notation 

This section presents probabilistic notations and 
information-theoretic definitions and identities that will 
be used throughout the remainder of the manuscript. Unless 
otherwise noted, the definitions and identities come from 
Cover & Thomas 



For a sequence ai, 02, . . ., denote a\ as (a^, . . . , aj) and 

k A k 

gt = a\. 

Denote the set of permutations n on {1, . . . , m} as II(m). 
For any Borel space Z, denote its Borel sets by B(Z) and 
the space of probability measures on (Z, £>(Z)) as V (Z). 
Consider two probability measures P and Q on V (Z). 
P is absolutely continuous with respect to Q (P <C Q) 
if Q(A) = implies that V(A) = for all A G B(Z). 
If P -C Q, denote the Radon-Nikodym derivative as the 
random variable 4^ : Z — > R that satisfies 



(IF 



(z)Q(dz), AGB(Z). 



The Kullback-Leibler divergence between P £ V (Z) and 
QeP(Z) is defined as 



if P <C Q and 00 otherwise. 

Throughout this paper, we will consider m random 
processes where the ith (with i G {1, . . . , to}) random 
process at time j (with j G {1, . . . , n}), takes values in 
a Borel space X. 

For a sample space ft, sigma-algebra J 7 , and probability 
measure P, denote the probability space as (fi, P). 
Denote the ith random variable at time j by Xij : 51 — > 
X, the ith random process as Xj = (X^i, . . . , : 
O — » X™, and the whole collection of all to random 
processes as X = (X x , . . . , X m ) T : -> X mn . 
The probability measure P thus induces a probability 
distribution on X^j given by Py 4 ■(") G V (X), a joint 
distribution on given by fx»(") G V (X™), and a joint 
distribution on X given by Px(-) G V (X" m ). 
With slight abuse of notation, denote X = Xj for some i 
and Y = Xj for some i ^ j and denote the conditional 
distribution and causally conditioned distribution of Y 
given X as 

P Y |x= x (dy) = P Y |x(dy|x) 

n 

= X\PY^-Kx~{dy % \tf-\x n )(2) 

i=l 

^ Y ||x=x(rfy) = /V||x(rfy||x) 

n 

= Y[PY^-KX'(d yi \y l -\x l ).(3) 
1=1 

Note the similarity with regular conditioning in Q, 
except in causal conditioning the future (xf +1 ) is not 
conditioned on (28l. The notation for Py|x=x an d 
P Y ||x=x is use d to emphasize that Py|x=x G V (X n ) 
and P Yi ix=x G V (X"). 

The mutual information and directed information (23] 
between random process X and random process Y are 
given by 

J(X;Y) = /p(P Y | X =xl| J P Y )Px(dx) (4) 

J X 

/(X-^Y) - / J D(P Y || X=x ||P Y )P x (dx) (5) 



£>(P||Q) = E ff 



log ■ 



log— (z)¥(dz) (1) 



Conceptually, mutual information and directed informa- 
tion are related. However, while mutual information quan- 
tifies statistical correlation (in the colloquial sense of sta- 
tistical interdependence), directed information quantifies 
statistical causation. For example, /(X; Y) = 7(Y;X), 
but 7(X -> Y) ^ I(Y -> X) in general. 



IV. Background: Chow and Liu Dependence Tree 
Approximations 

Consider the scenario where there are m random processes 
and there is no time axis (e.g. n = 1). Then this becomes a 
set of just m random variables X rn — {Xi, X2, • • • , X m } on 
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Fig. 4. Diagram of an approximating depen- 

dence tree structure. In this example, P x e(dx 6 ) = 
Px % {dx 6 )Px 1 \x 6 (dxi\^6)Px 3 \X e ( dx 3\ x 6)Px 4 \x 3 (dx 4 \x 3 ) 
xPx 2 \x 3 (dx2\x 3 )P X! .\ X2 (dx 5 \x2). 



X m . Note that the chain rule is given by 

m 

P X m(dx m ) = HPx^-iidxilx*- 1 ) 



), (6) 



where X° = and (|6]l holds for any permutation it E il(m). 
Chow and Liu developed an algorithm to approximate a full 
joint distribution by a product of second order distributions 
flTTJ . For their procedure, the chain rule is applied to the joint 
distribution, and each individual term in the product |6]l is 

approximated as Px Hi) \x <m) |&7r(i(»))) where K*) e 

{1, • • • ,i — 1}, such that the conditioning is on at most one 
variable. This approximation corresponds to a dependence tree 
structure (see Figure |4j. Each choice of ?r(i) and over 
{I,-- - , m} completely specifies a tree structure T. Denote 
the set of all possible trees by T and the tree approximation 
of P x ™ (x m ) using T e T by Pym (da; m ): 



■^Tr(i) F7r(J(i)) J 



(7) 



Chow and Liu's method obtains the "best" tree T e T, 
where the "goodness" is defined in terms of KL distance 
between the original distribution and the approximating distri- 
bution. They show the important property fTT[ : 



Theorem 1: 



argminD(Pxr 

TGT 



P xm ) = arg max V ; X^m))). (8) 



1=1 



See [11 1 for the original proof for discrete random variables, 
and [12] for a proof for general random variables. The 
optimization objective is equivalent to maximizing a sum of 
mutual informations. Thus, a global minimization is equivalent 
to (coupled) local maximizations. 

They also propose an efficient algorithm to identify this 
approximating tree by calculating the mutual information 
between each pair of random variables and assigning those 
values as weights in the corresponding dependency graph fTT) . 
Finding the dependence tree distribution that maximizes the 
sum ([8]) is equivalent to finding a tree of maximal weight in the 
underlying weighted graph fTT) . Kruskal's minimum spanning 
tree algorithm [40| can be used for this [11 1. The total runtime 
of this procedure is (D(m 2 ), where m is the number of random 
variables (vertices in the graph). 



A significant aspect of this result is that only the pairwise 
interactions need to be known or estimated in order to find 
the best approximation for the full joint. In many cases, 
the statistics of the data are initially unknown. Chow and 
Liu's procedure is particularly beneficial when the number of 
variables is large and, consequently, estimating the full joint 
distribution is prohibitive. A simple estimation scheme using 
empirical frequencies of i.i.d. data is described in [TTTJ . 

In pT[ , the authors show that if the joint distribution has a 
dependence tree structure, and if a sufficiently large number of 
i.i.d. samples are used, then with probability one the estimated 
tree will be the true joint. Recently, researchers have performed 
an error exponent analysis for estimating joint distributions 
with dependence tree structures. They showed that the error 
exponent of the probability of the estimated tree structure 
differing from the true tree structure is equal to the exponential 



rate of decay of a single dominant "crossover" event [42], [43|. 
This event occurs when a pair of non-neighbor nodes in the 
true tree structure share an edge in the estimated tree structure. 

V. Motivating Example: Approximating the 
Structure of Dynamical Systems 

As discussed in the introduction, there are potential prob- 
lems with Chow and Liu dependence tree approximations - the 
processes could be intermixed and temporal structure might 
not be kept, as well as an increasing complexity with time. 
We now consider how to not only keep the processes unmixed 
and complexity low, but also to identify causal dependencies 
between the processes. To gain intuition for how to approach 
this problem, we consider the structurally analogous problem 
of approximating real-world dynamical systems, which evolve 
through time. 

Consider approximating a physical, dynamical system. Such 
a system evolves causally with time according to a set of 
coupled differential equations. Specifically, consider a system 
with three processes, {xt,yt, zt}, which evolve according to: 

x t +A = x t +Ag 1 {x t ,y t ,z t ) 
Vt+A = y t + Ag 2 (x\y\z t ) 
z t +A = z t + Ag 3 (x t ,y t ,z t ) 

The causal dependencies can be depicted graphically 
5(a)\ . We can approximate this 
by approximating the functions 

{9i(x t ,y t 7 z t ),g 2 {x t ,y t ,z t ),g 3 (x t ,y t ,z t )} and using 
fewer inputs. For example, approximate g%(x t , y*, z r ) with a 
function g' 1 (x t ). One approximation for the system is: 



(see Figure 
dynamical system 



Xt+A 
Vt+A 
Zt+A 



x t + ^gx{x t ,y t ,z t ) 
y t + Ag 2 {x t ,y t ,z t ) : 
z t + A ff3 (x t ,y*,2; t ) ; 



■ xt + Ag'^x 1 ) 
Zt + Ag'^y^z 1 ), 



Figure 5(b) depicts the corresponding causal dependence tree 
structure for these coupled differential equations. 

A similar procedure can be used for stochastic processes, 
where the system is described in a time-evolving manner 
through conditional probabilities. Consider three processes 
{X, Y, Z}, formed by including i.i.d. noises {e^, e^, e"}f =1 to 
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(a) Full causal dependence 
structure 



(b) Causal dependence tree 
approximation 



Fig. 5. Dependence tree structures for the dynamical system. 

the above dynamical system and relabeling the time indices 
(up to time n): 

X l+1 = X i + Ag 1 (X i ,Y i ,Z i ) + e i 
Y l+1 = Y l + Ag 2 (X\Y\Z*) + e> 
Z i+1 = Z l + Ag 3 (X\Y\Z l ) + e' t ' 

The system can alternatively be described through the joint 
distribution 



P 



X.YZ 



(x,y,z) 



X i ,Y i ,Z i \X<- 1 ,Y* 



-i{dxi,dyi,dz,\x l 1 ,y i 1 ,z i 1 ). 



Because of the causal structure of the dynamical system, given 
the full past, the present values are conditionally independent: 



Px,Y,z(dx, dy, dz) = 

n 

YlPXtlX'-^Yi-^Z'-^dXilx 1 



- 1 ) 



x P 
x P 



A—l n A— 1 „i— 1 \ 



Yi\x*-i,Y*--t,z*-i(dyi\x l -,y -,z- 



-Zi\x*-i,Y*-\z'-i(dziW L ,y l ,z l L ). 

Rewrite this using the notation of causal conditioning ([3]) 
introduced by Kramer: 

Px : Y,z{dx,dy,dz) = Px||Y,z(dx || y, z)P Y ||x,z(dy || x, z) 

X-Pz||x,Y(dz II x,y). 

The dependence structure of this stochastic system is still rep- 
resented by Figure [5(a)] We can apply a similar approximation 
to this system as before, corresponding to the structure of 



Figure 5(b) with: 

Px||Y,z(dx || y,z) ss Px(rfx) 

-PY||x,z(dy II x,z) fa P-y\\x{dy || x) 

Pz||x,y(^z II x,y) ps P z \\y(dz || y). 

Thus, our causal dependence tree approximation to these 
stochastic processes, denoted by Px, is: 

Px(dx)«Px(rfx) 4 Px(dx)P Y || X (dy||x)P Z || Y (dz||y). 

Note that, with this type of approximation, the processes are 
not mixed together and, since the nodes represent processes, 
not individual variables, the graphical complexity remains 
low. Another important characteristic is that the system we 
are approximating is causal and our approximation is causal, 
which might not have been the case if the Chow and Liu 
algorithm was applied. We now consider the problem of 
finding the best causal dependence tree approximation using 
KL divergence as a measure of goodness. 




Fig. 6. Diagram of an approximating causal dependence tree structure. In 
this example, 

fx(o=) = -Px 6 (dx 6 )P Xl || X( .((ixi II x 6 )Px3||x 6 ( rfx 3 II x 6 ) 
X-fx 4 ||x 3 (rfx4 || x 3 )P X2 ||x 3 (dx2 || x 3 )P X; .|| X2 (<2x5 || x 2 ). 



VI. Main Result: Causal Dependence Tree 
Approximations 

Consider the joint distribution Px of m random processes 
{Xi, X2, • • • , X m }, each of length n. For a given tree T 
(defined by the functions ir(i) and over the index set of the 
processes i <= {1, ,m}), denote the corresponding causal 
dependence tree approximation as 



Px(rfx) = n Px ^oll x -(K.))( rfx ^W II X ^('W))- 



(9) 



An example of an approximating causal dependence tree, 
depicted as a directed tree, is shown in Figure [6] 

As in Chow and Liu's work, KL divergence will be used 
to measure the "goodness" of the approximations. Let Px(x) 
denote the causal dependence tree approximation of Px(x) 
for tree T. Let 7c denote the set of all causal dependence tree 
approximations for Px(x) and let Px(x) denote the product 
distribution 

m 

Px(dx) 4 IJPx^Xi), (10) 

i=i 

m 

i=i 

which is equivalent to Px(x) when the processes are statis- 
tically independent. Note that ( fTT| holds for any permutation 

7r e ii(to). 

The following result for the causal dependence tree that 
minimizes the KL divergence holds: 
Theorem 2: 



argminD(Px || Px) 

-PxeTc 



argmax^I(X w(z(l ))- 

Px6Tc j=l 



>X 7r(i) )(12) 



Proof: Note that Px, Px, Px all lie in V (fi), and 
moreover, Px <C Px -C Px- Thus, the Radon-Nikodym 
derivative 

dPy. 



satisfies the chain rule [44 



dPx = dPx dPx 
dPx ~ dPxdP* 

Taking the logarithm on both sides and rearranging terms 
results in: 



x 



, dPx , dP x dP, 

log = log -^== — log 

dP x dP x dPx 
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Thus, 

argmin D(Px II -fx) 
PxeTc 



argmin Ep x 

-PxSTc 

argmin Ep x 

-PxeTc 

argmax Ep x 

fx6Tc 



log 
log 
log 



dPx 
rfPx. 

rfPx" 
dPx. 
rf^x 
rfPx 
dP, 



E, 



log 



dP^ 
dP* 



(13) 



= ar 

f': 



gmax ^ / log ^^0 M ^^ p x(d x) (14) 
m /. 

argmax^p(P Xw( . 3 l| X ^ (!( . ))=x ||Px„ (i) )Px„ (i(j)) (<ix)(15) 



= argmax ^]l(X 7r(i(l)) -> X l(l) ), 
PxeTc i=i 



(16) 



where ( fT3j ) follows from not depending on Px; ( fT~4t > 



dP- 



follows from |9| and ( fTTj ); ( (T5] l follows from ([T); and ( [T6] ) 
follows from ([5j. 

■ 

Thus, finding the optimal causal dependence tree in terms 
of KL distance is equivalent to maximizing a sum of directed 
informations. Also note that when n = 1, there is an equiva- 
lence between this and Chow and Liu's result: 

Corollary 3: When n = 1, Theorem [2] reduces to Theo- 
rem Q] 

Similar to Chow and Liu's result, only the pairwise interac- 
tions between the processes need to be known or estimated to 
identify the best approximation for the whole joint. Two esti- 
mators for directed information from one process to another 
have recently been proposed. A parametric approach based on 
the law of large numbers for Markov chains and minimum 
description length is presented in |26|. A universal estimation 
approach based on context weighting trees is presented in |45|. 

VII. A Low Complexity Algorithm for Finding the 
Optimal Causal Dependence Tree 

In Chow and Liu's work, Kruskal's minimum spanning 
tree algorithm performs the optimization procedure efficiently, 
after having computed the mutual information between each 
pair of variables fTT) . A similar procedure can be done in this 
setting. First, compute the directed information between each 
ordered pair of processes. This can be represented as a graph, 
where each of the nodes represents a process. This graph will 
have a directed edge from each node to every other node (thus 
is a complete, directed graph), and the value of edge from node 
X to node Y will be I(X Y). 

There are several efficient algorithms which can be used 
to find the maximum weight (sum of directed informations) 
directed tree of a directed graph J46) , such as Chu and Liu [ |47[ 
(which was independently discovered by Edmonds [48 1 and 
Bock [49]) and a distributed algorithm by Humblet |50|. Note 




(a) Variable dependence (b) Variable dependence (c) Variable dependence 
structure for a full joint structure for a particu- structure for a 
distribution. lar Chow and Liu depen- causal dependence 

dence tree approximation, tree approximation 
^x,Y,z(rfx, dy, dz) = 
P x (dx)P Y || X (dy || x) 

^z||y(°!z II y)- 

Fig. 7. The variable dependence structures for a full joint distribution, a 
Chow and Liu dependence tree approximation, and a causal dependence tree 
approximation, for a set of three random processes with four timesteps. 



that in some implementations, a root is required a priori. For 
those, the implementation would need to be applied for each 
node in the graph as a root, and then the directed tree which 
has maximal weight among all of those would be selected. 
Chu and Liu's algorithm has runtime of 0(m 2 ) [46 1 . The 
total runtime of this procedure is (D(m 3 ). 

VIII. Properties of causal dependence trees 

Now we will consider some of the differences between 
Chow and Liu dependence trees and causal dependence trees 
in terms of variable dependencies and storage requirements. 

A. Dependencies between variables 

Causal dependence trees have a simple graphical representa- 
tion for random processes, unlike Chow and Liu dependence 
trees. For causal dependence trees, the processes are repre- 
sented by nodes, not the variables. However, the dependencies 
between variables induced by causal dependence trees can also 
be graphically represented. An example showing dependencies 
between variables for the full joint distribution, a Chow and 
Liu dependence tree approximation, and a causal dependence 
tree approximation, in a set of three random processes with 
four timesteps, is depicted in Figure [7] 

The graph of the variable dependence structure induced 



by a causal dependence tree approximation (Figure 7(c) i is 



not necessarily a tree. It is a structured subgraph of the 
variable dependence structure of the full joint (Figure |7(a)] >. In 
particular, a variable is allowed to have dependencies with all 
of the previous variables in its process and those in the past 
of the process being causally conditioned on. Consequently, 
the induced subgraph of the variables from a single process, 
such as {Yi, Y2, Y3, Y4} form a complete graph. In general, 
the set of possible Chow and Liu dependence trees (any tree 
on the variables) does not intersect with the set of possible 
causal dependence trees. In the limiting case of n = 1, the 
sets of possible trees are the same (see Corollary [51. 

Even though the graph of variable dependencies for a causal 
dependence tree is more complex than that of a Chow and 
Liu dependence tree, it is significantly less complex then 
a full joint distribution. Consider a network of m random 
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processes over n timesteps. There are mn variables total. 
For the graph of dependencies between variables for the full 
joint distribution, there are (("")) = 0(m 2 n 2 ) edges. The 
Chow and Liu dependence tree has mn — 1 or 0(mn) edges. 
The graph of dependencies between variables for a causal 
dependence tree distribution has a complete graph for each 
process (m^) edges), as well as k edges between a variable 
with index k to the current and all of the previous k — 1 
variables in the process being causally conditioned on. Since 
there are m — 1 processes which are causally conditioned on 
one other, there are 



(m — 1) k = (m — 1 



n(n + 1) 



k=l 



edges between variables of different processes. Consequently, 
the causal dependence tree has 

O ( ml J + (m - 1) v ' \ = 0(mn 2 ) 

edges total. These extra dependencies (edges) allow causal 
dependence trees to incorporate more dynamics of the system 
that pertain to how the processes evolve depending on their 
own past and possibly the past of other processes. 



B. Storage requirements 

One of the significant aspects of using the original Chow 
and Liu algorithm is the reduction in storage needed for the 
approximation. We will now examine the reduction in storage 
for causal dependence trees. Let m denote the number of 
processes, and n the length in time. There are mn variables 
total. For simplicity, assume each variable is over a finite 
alphabet of size |X| < oo. The full distribution requires 
C(|X| m ™) storage, since there are |X| m ™ realizations, each 
with a possibly unique probability. 

The Chow and Liu algorithm approximates the full joint 
with a product of second order distributions [11]. For example, 
given a joint distribution on six random variables, Pxe(dx 6 ), 
the Chow and Liu algorithm might approximate it as in 
Figure |4] with the following: 

P x e{dx 6 ) = Px 6 (dx 6 )P Xl \X 6 (dx 1 \x e )Px 3 \Xe( dx 3\x6) 

x Px i \x 3 {dx 4: \x 3 )Px 2 \x 3 (dx2\x 3 )Px 5 \x 2 (dx 5 \x2), 

or another product of this form. Each second order distribution 
requires 0(|X| 2 ) storage, and there are mn — 1 of them, 
one for each variable except the first, which has first order 
distribution. Thus, the total storage required for a Chow and 
Liu dependence tree approximation is C(mn|X| 2 ), which is 
linear in both the number of processes and time. 

The causal dependence tree approximation has a much 
simpler graphical representation than the Chow and Liu 
procedure in the context of random processes. However, it 
largely does not restrict dependencies within each process and 
between processes where causal dependencies are kept. For 
example, consider three processes {X,Y,Z} with a causal 
tree approximation 

Px,Y,z(dx,dy,dz) = P x (dx)P Y || X (dy || x)P Z |, Y (dz || y). 



This can be expanded into a product of conditional probabil- 
ities with increasing time 

Px,v,z(dx,dy,dz) = 

n 

YlPx^-^dx^x^PY^^idy^y 1 ' 1 ,^) 

i=i 

X-Pz 4 |zi-\r*(<fei|z i- W) 

The final terms have many dependencies. A variable is 
allowed to depend on the full past of its own process and 
the process that it is causally conditioned upon. The storage 
for the whole causal tree approximation will be dominated 
by the storage required for these terms. For each of these 
m — 1 final terms (conditioned on full past of two processes), 
0(|X| 2 ™) storage is required, so the total storage necessary is 
0(m|X| 2n ). Thus, the storage for causal dependence trees is 
exponentially worse than that for Chow and Liu dependence 
trees, but exponentially better than storing the full joint distri- 
bution. 

IX. Example 

Let us illustrate the proposed algorithm with a binary hy- 
pothesis testing example. We construct two networks of jointly 
gaussian random processes according to a generative model. 
Next, we apply the above procedure to form causal dependence 
tree approximations for both networks. Additionally, we apply 
the original Chow and Liu procedure to develop dependence 
tree approximations. Subsequently, the data generated from 
the original distributions is used in binary hypothesis testing 
(using log likelihood ratios with a threshold parameter). The 
performance of the causal dependence tree approximations in 
binary hypothesis testing is compared to that of the original 
distributions and that of the Chow and Liu dependence trees. 

The formula to compute directed information from a random 
process X to a random process Y, where X and Y are jointly 
gaussian random processes each of length n, is: 



j(x^y) = ^T/ck^if*- 1 ) 



i=i 



log [\Ky 



n 1 

Eo log 



i=l 



\K- 



,X* 



where \K Y i x* I is the determinant of the covariance matrix for 



the variables {Y' l ,X 1 }. The last line follows from [39]. We 
will now construct the networks, and then use this formula 
to calculate causal dependence tree approximations for the 
networks. 

Let X 6 denote six jointly gaussian, zero mean random 
processes. We specified two generative models, where each 
process at time i was a linear combination of a subset of the 
recent past of the other processes plus independent gaussian 
noise. Letting X denote a column vector containing all the 
variables, and N a column vector of independent normal noise, 
we specified the matrix A in: 

X = AX + N. 

To obtain the full covariance matrix for X, isolate X: 

X = (I- A) _1 N 
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ROC curves for the full distribution, causal dependence tree approximation, 
and Chow Liu dependence tree. 



(a) Causal dependency 
graph for first generative 
model (Hq). 



(b) Causal dependency 
graph for second 
generative model (Hi). 



Fig. 8. Graphs of the causal dependencies between the processes in the full 
joint distributions for the two generative models. The dependence structures 
are topologically similar. 





(a) Causal dependence tree 
approximation for the first 
generative model (Hq). 



(b) Causal dependence 
tree approximation for the 
second generative model 
(Hi). 



Fig. 9. Graphs of the causal dependence tree approximations for both of 
the generative model networks. Despite the topological similarities of the 
dependence graphs for the original distributions, these approximations are 
topologically distinct. 



and compute X X . Data can be generated for X by first 
generating N and then linearly transforming the result. The 
generative model graphs (with directed arrows depicting the 



causal dependencies) are shown in Figures 8(a) and 8(b) We 



applied the procedure to these two networks of jointly gaussian 
random processes, and the resulting causal dependence tree 
structures are depicted in Figures 9(a) and 9(b) We also 
used the Chow and Liu procedure to develop dependence tree 
approximations. To compute the Chow and Liu dependence 
tree approximation, we used publicly available code [51]. The 
number of dependencies between variables for the full joint 
distribution, the causal dependence tree approximations, and 
the Chow and Liu dependence tree approximations were 1770, 
495, and 59 respectively. 

Next, we generated data 10000 times from both original 
distributions and performed binary hypothesis testing (using 
log likelihood ratios with a threshold f) with the original 
distributions, the causal dependence tree approximations, and 
the Chow and Liu dependence tree approximations. Figure 10 
depicts the corresponding ROC curves. The causal dependence 
tree approximations, despite the significant reduction in struc- 
ture, still perform well in this task. Also, their performance is 
significantly better than that of the Chow and Liu dependence 
tree approximations. 
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Fig. 10. ROC curves for the full distributions (top-most), the causal 
dependence tree approximations (middle), and the Chow and Liu dependence 
tree approximations (bottom-most) in binary hypothesis testing. 
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